Brazilian Aeronautics Accidents

Statistical computation and visualization (MATH-517)

Zineb AGNAOU https://github.com/ZinebAg , Fahim BECK https://github.com/FahimBeck , Yiren CAO https://github.com/yirencao , Matias JANVIN https://github.com/matiasjanvin , Salima JAOUA https://github.com/salimajaoua , Seorim PARK https://github.com/seorimpark
December 24, 2021

Introduction

Modern society has been become increasingly dependent on the use of airplanes during the past decades. Despite the impressive engineering feat that aviation represents, airplanes and other aircraft sometimes fail, occasionally fatally. Understanding the causes of failure and considering what can be done to address these is of crucial interest for regulatory bodies and aviation authorities, but also for passengers. In the following dataset, we consider an extensive collection of incidents involving aircraft of different types across Brazil in the period of time 2006-2015. Various data have been collected for each incident. In what follows, we set out to investigate common features of these aircraft incidents.

Research questions

The main issue is to understand the factors that favor a plane crash, especially in Brazil in a 10-year period. Similarities between the accidents will be sought. Perhaps would it be possible to identify risk factors thereby improving passenger safety?

Here are some more specific questions arising from the main problem that we will try to answer:

  • How are the accidents distributed on a map? Are there areas where accidents are more concentrated?

  • How do the accidents evolve through time? Is the occurrence rate constant? Are there periods with more accidents, i.e. during school vacation, summer? Do they happen at a certain time of the day?

  • Does the age of the aircraft play a role? Are old planes more prone to accidents?

  • What about the characteristics of the aircraft? Is the model, the mass or the number of engines take a part in causing a crash?

  • What are the main occurrence types or causes of accidents?

  • What can we say about damage severity of the accidents? Are there any correlations?

Approaches

  • We represented accidents by state in Brazil in order to visualize their distribution. To go deeper, we used some \(\textsf{R}\) packages for geocoding (the data having only the name of the places, but not the geographical coordinates) in order to represent each accident on a map.

  • We employed linear, logistic and quantile regression to investigate the association between aircraft lifetime, accident severity and damage level.

Sources of information / datasets

The dataset used in this report is from Kaggle provided by the CENIPA (Centro de Investigação e Prevenção de Acidentes aeronáuticos, Brazilian Open Data). The given dataset contains two files (aircrafts.csv' andoccurrences.csv’), but merging them based on the unique occurrence ID (\(occurrence\_id\)) for each aircrafts was possible, so the latter one (`aircrafts_occurrences_merged.csv’) was used to analyse the different occurrences from the aircrafts.

The file contains different information about the aircraft and the accident. First of all, the dataset contains several columns about the features of the aircrafts which encountered accidents:

  • \(aircraft_id\) - Aircraft’s identification number
  • \(registration\) - Aircraft’s registration number
  • \(operator\_id\) - Operator’s identification number
  • \(equipment\) - Type of the aircraft (Airplane, Helicopter, Glider, Ultralight, Amphibious, Airship). Some of them are unknown.
  • \(manufacturer\) - Name of the manufacturer of the aircraft
  • \(model\) - Name of the model of the aircraft
  • \(engine\_type\) - Type of the engine
  • \(engines\_amount\) - Amount of the engine (it goes from 0 to 4)
  • \(takeoff\_max\_weight (Lbs)\) - Maximum weight that the aircraft can hold for the takeoff
  • \(seatings\_amount\) - Number of available seats of the aircraft
  • \(year\_manufacture\) - Year when the aircraft was built
  • \(registration\_country\) - Country in which the aircraft was registered (mostly Brazil, but also some other neighboring countries)
  • \(registration\_category\) - Aircraft’s registration category at the moment of the occurrence
  • \(registration\_aviation\) - Aircraft’s aviation category at the moment of the occurrence
  • \(origin\_flight\) - Location of the departure of the flight.
  • \(destination\_flight\) - Planned destination of the flight.
  • \(operation\_phase\) - Operation phase at the moment of the occurrence
  • \(type\_operation\) - Type of operation at the moment of the occurrence
  • \(damage\_level\) - Level of damage of the aircraft, there are four levels (None, Light, Substantial, Destroyed). Some of them are unknown.
  • \(fatalities\_amount\) - Number of fatalities from the accident.
  • \(extraction\_day\) - Date when the data was collected.

Then, the dataset also contains columns about the occurrences: * \(classification\) - Degree of seriousness of the accident (serious incident or accident) * \(type\;of\;occurrence\) - Type of accident * \(localization\) - City where the occurrence happened * \(fu\) - State where the occurrence happened * \(country\) - Country where the occurrence happened (mostly Brazil, but also some other neighboring countries) * \(aerodrome\) - Aerodrome’s ICOA code * \(occurrence\_day\) - Occurrence date * \(time\) - Time of the occurrence (UTC) * \(under\_investigation\) - If the accident is or went under investigation (Yes, no or unknown) * \(investigating\_command\) - Investigation command responsible for the occurrence investigation * \(investigation\_status\) - Status of the investigation * \(report\_number\) - Final report’s identification number * \(published\_report\) - If the report was published (If yes, 1) * \(publication\_day\) - Publication day of the report * \(recommendation\_amount\) - Security recommendation quantity emitted * \(aircrafts\_involved\) - Quantity of aircrafts involved in the occurrence * \(takeoff\) - * \(extraction_day\) - the date when the data was collected.

Exploratory data analysis

Table 1: Repartition of NAs in the columns
occurrence_id aircraft_id registration operator_id equipment manufacturer model engine_type engines_amount takeoff_max_weight..Lbs. seatings_amount year_manufacture registration_country registration_category registration_aviation origin_flight destination_flight operation_phase type_operation damage_level fatalities_amount classification type.of.occurrence localization fu country aerodrome occurrence_day time under_investigation investigating_command investigation_status report_number published_report publication_day recommendation_amount aircrafts_involved takeoff
0 0 0 0 0 0 0 0 9 0 18 4 0 0 0 0 0 0 0 0 1,688 0 0 0 0 0 0 0 0 0 0 0 0 1,042 0 0 0 1,787

The dataset is full of missing values to the point where there is no complete row. The Table 1 displays the number of missing values in each column. \(Takeoff\) is the top column containing NAs, with more than 87.4% of the data missing. We see also a significant number of missing data points in the columns related to the report such as \(report_number\) with more than 78.6%, \(published_report\),\(publication_day\) and \(publication_Week\) with 51%. We can see that the data is most present for columns containing information that are independent of the accident such as \(aircraft_id\), \(manufacturer\) and \(type_operation\).

When looking at the type of equipment, 78% of our data concerns airplanes. It is followed by helicopter that constitue a little less that 13% of the data. Then comes ultralight, a bit more than 7% of the dataset. The rest are airship, amphibious our unknown.

There are 120 unique manufacturers represented in our dataset. The top three manufacturers are the following: \(NEIVA INDUSTRIA AERONAUTICA, EMBRAER and AERO BOERO\) with respectively 388, 155 and 126 planes.

78% of the planes present in the dataset use a $ PISTON$ engine. 7% use \(TURBOSHAF\) engine, 6.8% use \(TURBOPROP\) and the rest \(JET\), \(WITHOUT TRACTION\) or unknown engines.

Distribution of the number of engines

Figure 1: Distribution of the number of engines

Considering Figure 1, 73.25% of planes use a single engine. 24% of planes use two engines. The rest of planes have either 3 or 4 engines. We believe that 0 refers to unknown data although there was no mention of it in the data description.

Distribution of take off max weight (in Lbs)

Figure 2: Distribution of take off max weight (in Lbs)

Figure 2 is displayed the distribution of take off max weight (in Lbs). It can be seen that that the mean weight is of 11,919.2 Lbs (5,406.4 Kg). There are though very strong outliers with planes having set their maximun weight at 630,499 Lbs (285,989.5 Kg).

Distribution of manufacturing year

Figure 3: Distribution of manufacturing year

In Figure 3 is displayed the distribution of manufacturing year. One can observe that the planes represented in the dataset were manufactered between 1936 and 2015. The year distribution is trimodal with high peaks in late 70s, early 90s and mid 00s.

Concerning the registration country, 2000 out of the 2043 constituting the data are registered in Brazil. 22 are registered in the United States of America and finally there is one accident registered in Germany, France, Poland, Russia, Saudia Arabia, South Africa, Spain and Uruguay.

37% of the accident concern \(private\) registrations. 18% come from \(instruction\) registration. The rest are either \(aerotaxi\) (13%),\(experimental\) (9.8%), \(agricultural\) (9%), \(regular\) (4%), \(specialized\) (3%) or other registrations.

19% of the accidents registered happened when landing. 17% when taking off. 11% when cruising and 9% during the run after the landing. Accidents also happen during the following phases: during maneuver (4.9%), during ascension(4.8%), final approximation (3.4%), descend (3.3%), low altitude navigation (2.8%), traffic circuit (2.5%), taxi (2%), rush on the ground (1.4%) and other less represented phases.

Concerning the damage, 72.6% of crashes were classified as \(accident\) and the rest was classified as \(serious incident\). 58% of the planes knew substantial damage after the crash. 17% are completely destroyed, 12.7% know light damage where 8% have no damage. The rest is unknown.

Distribution of fatalities

Figure 4: Distribution of fatalities

In Figure 4 is displayed the number the fatalities. It can be seen that most of the crashes involve less than 20 deaths. The deadliest crash was in a private helicopter in 1999 in Sau Paolo where 199 lost their life during the landing.

Finally, the investigation of 52% of the crashes is finished. 37% are in progress and less than 0.05% of the crashes have had their investigation reopened.

Geographical visualization of accidents

This section is devoted to the study of the places where aircraft accidents occur. Indeed, it is essential to understand this in order to identify more risk factors.

First, a comparison between the Brazilian states allows us to draw a first conclusion (Figure 5). Note that accidents are cumulative over the 10-year period.

Figure 5: Number of accidents from 2006 to 2015 by state in Brazil

All the states have less than 200 accidents, more precisely less than 170 (Rio Grande do Sul), except one state: São Paulo. This state is not larger than the others (it is rather medium in size), which indicates that there is clearly a concentration of accidents in this area. Note that ten accidents are not shown on the map (Table 2). Eight of them occurred outside of Brazil and two more for other reasons: one ended up in international waters and the other does not have its location identified.

Table 2: Accidents not reprensented on the previous map by country
Countries ARGENTINA BRAZIL COLOMBIA ENGLAND PARAGUAY PERU URUGUAY
Accidents 1 2 1 1 2 1 2

An explanation for this observation is linked in particular to demography. Indeed, the state of São Paulo is the most populous in Brazil with more than 41 million inhabitants in 2010 (Wikipedia, 2021a). This represents more than 20% of the total population of the country. Air traffic is therefore concentrated there, especially since the largest airports of the country are located in this state (for example São Paulo Guarulhos International Airport and São Paulo Congonhas Airport). The risk of accidents is therefore higher.

To be more precise in the geographical approach, it is possible to establish a map according to the location of the accident (nearest city). As the geographic coordinates are not included in the data, it is possible to use geocoding given that the city is available. For this, we used two packages: ggmap and Nominatim. An API key is required to geocode for each package (Google Maps for ggmap and MapQuest for Nominatim). As the locations are quite precise, there are often errors in the coordinates returned by geocoding, which is why we used two packages by selecting the most reasonable coordinates. It is still possible to have some deviations between the city and the place shown on the map, but the latter should not be greater than a degree of longitude and latitude.

In Figure 6, accidents are represented in clusters which are scattered by zooming. For each accident, information is available such as the model, the manufacturer, the date or the time. Note that 11 more accidents have been removed compared to the last map. The reason is that the location has not been identified.

Figure 6: Number of accidents from 2006 to 2015 clustered by location in Brazil

We still notice the same thing, that is to say a concentration of accidents in the Southeast Region of Brazil. This is explained by the fact that the traffic there is the most dense, as this article confirms in particular (Oliveira et al., 2020). It turns out that some cities are noteworthy in terms of the number of accidents. Table 3 represents the ten cities with the most accidents.

Table 3: Top ten cities with the highest number of accidents
Cities Rio De Janeiro São Paulo Goiânia Brasília Manaus Belo Horizonte Campo Grande Londrina Bragança Paulista Porto Alegre
Accidents 65 50 42 31 29 26 24 23 22 22

A demographic explanation is still reasonable. The first six cities in Table 3 are among the ten most populous cities in Brazil (Wikipedia, 2021b). Moreover, they also have airports on the list of the 20 busiest airports in Brazil (Wikipedia, 2021c).

Looking more closely at the accidents in these cities, we see that all kinds of aircraft are represented, whether they are airliners, tourist planes or even helicopters. To learn more in this direction, it is possible to further explore the accidents on the map (Figure 6).

Time analysis

Visualization

Distribution of Accidents throughout the hours of the day

Figure 7: Distribution of Accidents throughout the hours of the day

In Figure 7 are displayed the distribution of accidents throughout the hours of the day. The distribution is bimodal and it can be observed that most of the crashes happen either 2pm or 8pm.

Distribution of Accidents throughout days

Figure 8: Distribution of Accidents throughout days

In Figure 8 are displayed the distribution of accidents throughout days. we can see that there is an increasing trend.

of time difference between the occurance and the publication date

Figure 9: of time difference between the occurance and the publication date

In Figure 9 are displayed the distribution of time difference in days between the occurance day and publication day. It can be seen that most of crashes are reported the same year. The longest time difference is of 3589 days. It was for a political operation caused by a loss of control in the air in year 1986 in Brazil.

Lifetime of aircraft

In this section, we consider whether there is any association between the age of an aircraft and the severity and damage level of the incident. To begin, we have plotted the cumulative distribution function of the aircrafts’ age in Figure 10 below.

Cumulative distribution function of the aircraft's lifetime, denoted by $X$

Figure 10: Cumulative distribution function of the aircraft’s lifetime, denoted by \(X\)

We observe that more than 80% of the aircraft are between 0 and 40 years old at the time of the incident, with a small number of aircraft reaching nearly 80 years.

Next, we provide a scatter plot to visually inspect the association between age and damage level in (11)

Scatter plot of damage level (coded on a 0-3 scale) versus aircraft lifetime in recorded incidents

Figure 11: Scatter plot of damage level (coded on a 0-3 scale) versus aircraft lifetime in recorded incidents

and likewise between aircraft lifetime damage level (outcome) in (12)

Scatter plot of severity versus aircraft lifetime in recorded incidents

Figure 12: Scatter plot of severity versus aircraft lifetime in recorded incidents

To investigate the marginal association between age and damage level we perform regression analysis using a linear model, logistic model in addition to quantile regression.

The fit diagnostic for the linear regression of damage level on lifetime of the plane gives


Call:
lm(formula = df.damage.age$damage_level ~ df.damage.age$lifetime)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8694  0.1309  0.1334  0.1356  1.1392 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)             1.8693569  0.0341994  54.661   <2e-16 ***
df.damage.age$lifetime -0.0001235  0.0012004  -0.103    0.918    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7949 on 1884 degrees of freedom
Multiple R-squared:  5.621e-06, Adjusted R-squared:  -0.0005252 
F-statistic: 0.01059 on 1 and 1884 DF,  p-value: 0.918

Here, we have coded the damage level on a scale of 0 to 3. Furthermore, the confidence interval is given by

                                beta        2.5 %      97.5 %
(Intercept)             1.8693568526  1.802284164 1.936429541
df.damage.age$lifetime -0.0001235299 -0.002477756 0.002230697

which is small and includes the null value of the slope.

Likewise, the logistic regression of aircraft lifetime on severity (with corresponding confidence intervals) is given below:


Call:
glm(formula = classification ~ lifetime, family = "binomial", 
    data = df.damage.age)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8589  -0.7911  -0.7661   1.5910   1.6827  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.13355    0.09886 -11.466   <2e-16 ***
lifetime     0.00413    0.00342   1.208    0.227    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2171.2  on 1885  degrees of freedom
Residual deviance: 2169.8  on 1884  degrees of freedom
AIC: 2173.8

Number of Fisher Scoring iterations: 4
(Intercept)    lifetime 
   0.321890    1.004138 
                  OR     2.5 %   97.5 %
(Intercept) 0.321890 0.2646635 0.390004
lifetime    1.004138 0.9974175 1.010886

Once again, the confidence interval includes the null value. Finally, we consider the quantile regression coefficient from the quantile regression model \[ Q_Y(\tau\mid X) = a_0(\tau) + b_0(\tau)X \] where the outcome \(Y\) is airplane lifetime and we take accident severity as an exposure \(X\). The resulting regression coefficient is plotted in Figure (13)

Quantile regression of damage level on the quantile of aircraft lifetime

Figure 13: Quantile regression of damage level on the quantile of aircraft lifetime

The quantile plot shows that the conditional cumulative distribution function, conditioning on severity level ‘serious incident,’ is narrower compared to the conditional cumulative distribution function, conditioning on severity level ‘accident.’ In other words, both high and low quantiles are shifted towards the median.

As we did not find any strong associations between damage level and age of the aircraft marginally in the population. This motivated us to examine further whether such associations could exist within subsets of the population, such as the strata of incidents involving helicopters. Once again, we perform a logistic regression, which yields the following fit diagnostic and confidence intervals:


Call:
glm(formula = classification ~ lifetime, family = "binomial", 
    data = df.helicopters)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8589  -0.7911  -0.7661   1.5910   1.6827  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.13355    0.09886 -11.466   <2e-16 ***
lifetime     0.00413    0.00342   1.208    0.227    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2171.2  on 1885  degrees of freedom
Residual deviance: 2169.8  on 1884  degrees of freedom
AIC: 2173.8

Number of Fisher Scoring iterations: 4
(Intercept)    lifetime 
   0.321890    1.004138 
                  OR     2.5 %   97.5 %
(Intercept) 0.321890 0.2646635 0.390004
lifetime    1.004138 0.9974175 1.010886

We do not find a strong association between accident severity and aircraft age within the stratum of helicopters either (the confidence interval for the lifetime coefficient includes the null-value, and the p-value of the coefficient is 0.227).

Features of the aircrafts, possible source of accidents

Occurence type

Other features

Different features of aircrafts are presented in the dataset, and some types of them might have caused the most accidents among all. An interactive plot (link) involving pie charts of the number of counts of different features is made to compare and determine which types of features have the most occurrences. The columns \(equipment\), \(manufacturer\), \(model\), \(engine_type\), \(engines_amount\), \(registration_aviation\), \(operation_phase\), \(type_operation\).

(Different screenshots of the pie charts and short analysis (the largest&trend))

The correlation between the different features is interesting to observe, since we can omit certain columns to analyse the type of the features of aircrafts that encountered the most accidents. However there are some subtleties in analyzing the correlation because the variables are categorical. As most of the variables are not ordered, several columns suspected to be correlated are chosen.

The aim is to see if between the two chosen columns, one is a further classification of the other one. For instance, we want to check if aircrafts of the same model are from the same manufacturer. In other words, we want to check if the column \(model\) is a further classification of the column \(manufacturer\). Mainly, we want to check if the aircraft models have a certain type of fixed features, because if they are, those features can be omitted and the information about the model is enough for the analysis.

To check, let A and B the columns. Then the count of categories of A is counted for each categories of B, and the sum of the number of outliers (all counts except the largest count from A) is computed. We made several hypotheses and checked if they were correct:

  • Are the aircrafts from the same model belonging to the same type of aircraft?
  • Do the aircrafts from the same model come from the same manufacturers?
  • Do the aircrafts from the same model have the same maximum weight for the takeoff?
  • Do the aircrafts from the same model have the same number of available seatings?
  • Do the aircrafts from the same model have the same number of engines?
  • Do the aircrafts from the same model have the same type of engines?
  • Do the aircrafts from the same model registered as to have the same purpose?
Table 4: the pair of columns chosen and its number of outliers
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
Number_of_outliers 7 126.857 143.798 5 7.5 192 392

Table 4 shows the number of outliers computed from each pairs of columns. We can observe that on one hand, there are several columns that are indeed highly correlated to the model, such as the type of aircraft, the amount of engines and the type of engines. On the other hand, there are columns that do not often match with the one of the model: the \(registration_aviation\) has the most outliers among all. We can conclude that although some columns can be omitted, the others are still important to consider. For instance, for \(registration_aviation\), almost 20% of the cases are outliers, for \(takeoff_max_weight (Lbs)\) and \(seatings_amount\) about 10% are outliers, and for \(manufacturer\) about 5% are outliers. We can see that quite often the aircrafts users changed some features from the original aircraft model.

The maximum weight for the takeoff and the number of available seats of the aircrafts have also been used for the analysis.

Different plots

Figure is a scatter plot of the count of the accidents of each the maximum weights. We can observe that

Figure is a scatter plot of the count of the accidents of each number of available seatings.

include analysis about the fatalities? or give this part to someone else

Conclusion

  • By visualizing the accidents on a map, we found that there is a concentration of accidents in the state of São Paulo, or more generally in the whole Southeast Region of Brazil. The possible explanations are linked to demography or the high density of air traffic. As air activity is higher in these areas, the risk of an accident to occur increases.

  • We employed linear, logistic and quantile regression to investigate the association between aircraft lifetime, accident severity and damage level, but did not find any statistically significant associations.

Future improvements

Oliveira, R. P., Oliveira, A. V. M., Lohmann, G., & Bettini, H. F. A. J. (2020). The geographic concentrations of air traffic and economic development: A spatiotemporal analysis of their association and decoupling in brazil. Journal of Transport Geography, 87, 102792. https://doi.org/10.1016/j.jtrangeo.2020.102792
Wikipedia. (2021a). List of brazilian states by population — Wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/List_of_Brazilian_states_by_population
[Online; accessed 18-December-2021]
Wikipedia. (2021b). List of cities in brazil by population — Wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/List_of_cities_in_Brazil_by_population
[Online; accessed 18-December-2021]
Wikipedia. (2021c). List of the busiest airports in brazil — Wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/List_of_the_busiest_airports_in_Brazil
[Online; accessed 18-December-2021]

References